Problem Statement¶
Business Context¶
A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.
Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.
Objective¶
SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.
To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.
Data Description¶
The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.
- Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
- Product_Weight - weight of each product
- Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
- Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
- Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
- Product_MRP - maximum retail price of each product
- Store_Id - unique identifier of each store
- Store_Establishment_Year - year in which the store was established
- Store_Size - size of the store depending on sq. feet like high, medium and low
- Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
- Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
- Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store
Installing and Importing the necessary libraries¶
#Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.3 huggingface_hub==0.30.1 -q
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# For splitting the dataset
from sklearn.model_selection import train_test_split
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)
# Libraries different ensemble classifiers
from sklearn.ensemble import (
BaggingRegressor,
RandomForestRegressor,
AdaBoostRegressor,
GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
# Libraries to get different metric scores
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
f1_score,
mean_squared_error,
mean_absolute_error,
r2_score,
mean_absolute_percentage_error
)
# To create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline
# To tune different models and standardize
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler,OneHotEncoder
# To serialize the model
import joblib
# os related functionalities
import os
# API request
import requests
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi
Loading the dataset¶
#Loading the data from google drive
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/Datasets/SuperKart.csv')
#Making a copy of the data
data = df.copy()
Data Overview¶
# Taking a look at the first 5 rows of data
data.head()
#Taking a look at the last 5 rows of data
data.tail()
# Looking at the shape of the dataset
data.shape
- There are 8763 rows and 12 columns in the dataset
Checking the data types¶
#Checking the datatypes for each of the variables
data.info()
- We have 7 object variables, 4 floats, and 1 integer. It seems from a glance that we don't have any missing values in the dataset as there are 8763 non-null values within each column.
Statistical Summary - Numeric¶
# Getting the statistical summary of the numeric data in the dataset
data.describe()
Product_Weight:
- The mean (12.65) and 50 percentile (12.66) are pretty close suggesting a normally distribution
- Minimum is 4, Max is 22, and standard deviation is 2.21 - suggesting outliers on both ends
Product_Allocated_Area:
- The mean (0.068) and 50 percentile (0.056) suggest the distribution is right skewed - meaning SuperKart only gives a lot of retail space to few items
- Minimum is 0.004, Max is 0.298, and standard deviation is 0.048 - suggesting many outliers on the larger end of the distribution
Product_MRP:
- The mean (147.03) and 50 percentile (146.74) suggest a normal distribution
- Min is 31, Max is 266, and std is 30.69 - suggesting outliers on both ends
Store_Establishment_Year: (our only integer value)
- The mean (2002) and 50 percentile (2009) suggest a left skewed distribution, meaning there are more newer stores than older ones
- Min is 1987, Max is 2009 and std is ~8. Interesting to note also that the median and Max are the same, further confirming that 50% of stores were built in 2009, and most after 2001.
Product_Store_Sales_Total: (our target variable)
- The mean (3464) and 50 percentile (3452) suggest a normal distribution of values
- Min is 33 (which might be a typo as product MRP min is 31 - meaning that one particular store only sold about 1 item of the lowest value), Max is 8000, and std is 1065 - suggesting outliers on both ends as most stores will generate between ~2400 and ~4500 in revenue.
Statistical Summary - Categorical¶
# Getting the statistical summary of the categorical data
data.describe(include='object')
Product_ID:
- As pointed out in the data descriptions, this is unique to each item with 8763 values and 8763 unique values
Product_Sugar_Content:
- This variable contains 4 unique values
- "Low Sugar" is the most frequent value with over half of all values (~55%)
Product_Type:
- There are 16 unique values within Product Type
- "Fruits and Vegetables" is the most common with about 14% of all sales
Store_Id:
- There are 4 unique Store Ids
- "OUT004" is the most frequent at ~53% of all values - suggesting it carries more products than the other 3 stores combined.
Store_Size:
- There are 3 unique values and from the data descriptions, I assume High, Medium, and Low
- "Medium" is the most frequent at ~68% of all values
Store_Location_City_Type:
- As the data description pointed out, there are 3 unique values - Tier 1, 2 and 3
- "Tier 2" is the most frequent at ~71% of all values
Store_Type:
- As the data description pointed out, there are 4 unique values
- "Supermarket Type2" is the most frequent with ~53% of all values
Duplicated Value Check¶
# Checking the data for duplicate values
data.duplicated().sum()
- There are no missing values in the dataset.
Missing Value Check¶
# Checking for missing values in the dataset
data.isnull().sum()
- There are also no missing values in the dataset, so we are clear to proceed without treatment.
Exploratory Data Analysis (EDA)¶
Univariate Analysis¶
Utility Functions¶
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n]
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Numeric Data Exploration¶
Product_Weight¶
# Plotting the distribution of Product Weight
histogram_boxplot(data, "Product_Weight")
- As the statistical summary showed, the distribution is normal with 50% of the products having a weight between ~11 and 14. There are multiple outliers on both ends as well.
Product_Allocated_Area¶
# Plotting the distribution of Product Allocation Area
histogram_boxplot(data, "Product_Allocated_Area")
- Again, true to the statistical summary, we see that the distribution is highly right skewed. 75% of the products get less than 0.10 space. 25% getting between 0.10 and 0.18, and multiple outliers getting .20 or more in retail space.
Product_MRP¶
# Plotting the distribution of Product MRP
histogram_boxplot(data, "Product_MRP")
- The distribution for Product Max Retail Price is very normal. 50% of the products are priced between ~ 130 and 180 with a few outliers less than ~75 and more than ~230.
Store_Establishment_Year¶
# Plotting the distribution of Store Establishment Year
histogram_boxplot(data, "Store_Establishment_Year")
- It seems all 4 stores were built in different years ranging from 1987 - 2009.
data.Store_Establishment_Year.unique()
Product_Store_Sales_Total¶
# Plotting the distribution of Product_Store_Sales_Total
histogram_boxplot(data, "Product_Store_Sales_Total")
- The distribution is normal, with a mean of ~ 3500. There are multiple outliers on both sides of the distribution.
Categorical Data Exploration¶
As Product_Id is a unique value for each item carried, we will forego plotting that variable
Product_Sugar_Content¶
# Plotting the distribution of Product_Sugar_Content
labeled_barplot(data, "Product_Sugar_Content", perc=True)
- As seen earlier, there are 4 unique values and Low Sugar makes up more than half of all values in the distribution.
Product_Type¶
# Plotting the distribution of Product_Type
labeled_barplot(data, "Product_Type", perc=True)
- There are 16 unique values in the dataset. Fruits and Vegetables is the most common with ~14% of all values. Snack Foods is also quite common at ~13%.
Store_Id¶
# Plotting the distribution of Store_Id
labeled_barplot(data, "Store_Id", perc=True)
- There are 4 unique values in the Store_Id variable of the dataset with OUT004 being by far the most common - making up over half of all values.
Store_Size¶
# Plotting the distribution of Store_Size
labeled_barplot(data, "Store_Size", perc=True)
- This variable consists of 3 unique values - Small, Medium, and High. Medium is the most common at around ~68% of all values.
Store_Location_City_Type¶
# Plotting the distribution of Store_Location_City_Type
labeled_barplot(data, "Store_Location_City_Type", perc=True)
- There are 3 unique values within this variable - Tier 1, 2, and 3. The vast majority of products that carried/sold are located in cities with a medium standard of living at ~71% of all values.
Store_Type¶
# Plotting the distribution of Store_Type
labeled_barplot(data, "Store_Type", perc=True)
- There are 4 unique values within this variable with "Supermarket Type2" making up over half of all values at ~53%.
- Also of note, these values are the exact same values given by Store Id.
Bivariate Analysis¶
Correlation Matrix for Numeric Variables¶
# Adding my numeric variables to a list
num_vars = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Plotting the heatmap
plt.figure(figsize=(10, 7.5))
sns.heatmap(data[num_vars].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
- There are very few negatively correlated values in the dataset.
- There are no values that are so highly correlated that we can afford to drop any as features.
- The two most highly correlated variables are Product_MRP and our target variable, Product_Store_Sales_Total. I expect Product_MRP may be one of our more influential variables.
- Also of note, Product_Weight is just behind Product_MRP in terms of correlation with our target variable. It too will have a heavy influence.
- Not surprisingly, Product_Weight and Product_MRP are a medium strong correlation.
- It seems preliminarily that the more larger, high-priced items sell, the more money the stores make.
Distribution of Numeric Variables wrt Target Variable¶
# Removing the target variable from our numeric variables list
num_vars.remove('Product_Store_Sales_Total')
# Setting up a nice-size horizontal plot
plt.figure(figsize=(20, 5))
# Looping through the variables in num_vars list and plotting a scatterplot with Product_Store_Sales_Total
for i, var in enumerate(num_vars):
plt.subplot(1, len(num_vars), i + 1)
sns.scatterplot(x=var, y='Product_Store_Sales_Total', data=data)
plt.show()
- As we gathered from the correlation matrix, there is a positive linear relationship between Product_Weight and Product_MRP with Product_Store_Sales_Total.
- There isn't really any clear relationship between Product_Allocated_Area and Store Sales other than to see that most of the store sales are coming from products under ~0.20 allocated area.
- Interestingly, stores built in 1987 and 2009 have similar, average performance in terms of Store Sales, but stores built in 1998 are the lowest performing while stores built in 1999 are the highest performing stores.
- This one floating point at the very top of Product Weight v. Sales, Allocated Area v. Sales, MPR v. Sales and only carried in stores established in 1999 is interesting. Let's try to isolate it to find out what it is.
# Attempting to isolate the floating outlier by filtering our data accordingly
isolation_df = data[(data['Product_Weight'] > 20) & (data['Store_Establishment_Year'] == 1999) & (data['Product_MRP'] > 250)]
isolation_df.shape
isolation_df
- This singular item represents the maximum value in our original dataset in three separate variables, Product_Weight, Product_MRP, and Product_Store_Sales_Total.
- There is also a chance that this singular item is contributing to the 1999 store performing so well.
- We'll keep this in our back pocket to see if it becomes more important later. It could be that this is a major contributor to a very specialized store, or it could be an outlier that needs to be removed.
- For now, let's investigate how many other items have similar performance - specifically they are responsible for pulling in 30 times their Product_MRP in Store Sales Total
# Creating a dataframe where Product_Store_Sales_Total is more than 25 times the Product_MRP
power_item_df = data[data['Product_Store_Sales_Total'] >= 30 * data['Product_MRP']]
power_item_df
- NICE! We've identified 320 items in the dataset that are "golden gooses" responsible for pulling 30 times thier Product_MSRP in Store Sales. We'll keep this in our back pocket and apply this in feature engineering.
# Getting the unique values of Product_Id's of Power Items in order to list them later when converting inputs from the Streamlit app
power_item_df['Product_Id'].unique()
Distribution of Categorical Variables wrt Target Variable¶
# Adding a utility function here to plot categorical variables wrt Product_Store_Sales_Total
def cat_bivariate_barplot(feature):
plt.figure(figsize=(8, 5))
plt.xticks(rotation=90)
# Group the data and calculate the sum of sales for each category
grouped_data = data.groupby([feature], as_index=True)["Product_Store_Sales_Total"].sum().reset_index().sort_values(by='Product_Store_Sales_Total', ascending=False)
# Calculate percentages
total_sales = grouped_data['Product_Store_Sales_Total'].sum()
grouped_data['Percentage'] = (grouped_data['Product_Store_Sales_Total'] / total_sales) * 100
# Plot the barplot using the grouped data
ax = sns.barplot(x=feature, y='Product_Store_Sales_Total', data=grouped_data, order=grouped_data[feature], palette="Paired")
ax.set_xlabel(feature)
ax.set_ylabel("Revenue")
# Add percentage labels
for i, row in enumerate(grouped_data.itertuples()):
ax.text(i, row.Product_Store_Sales_Total, f"{row.Percentage:.1f}%", ha='center', va='bottom')
# Add a title to the graph
plt.title(f'Distribution of {feature} wrt Product_Store_Sales_Total')
plt.show()
Product_Sugar_Content¶
# Plotting the distribution of Product_Sugar_Content wrt Target Variable
cat_bivariate_barplot('Product_Sugar_Content')
- This distribution looks remarkably similar to our univariate distribution for Product Sugar Content, which is not too surprising.
- Low Sugar items are contributing the most to Store Sales, while Regular is the runner up.
Product_Type¶
# Plotting the distribution of Product_Type wrt our Target Variable
cat_bivariate_barplot('Product_Type')
- Again, this looks very much like the Univariate Distribution from Product_Type.
- Fruits and Vegetables and Snack Foods are the largest contributors.
- Baking Goods, Canned, Dairy, Frozen Foods, Health and Hygeine, Household, Meat, and Soft Drinks are all medium-level contributors to Store Sales
- Breads, Breakfast, Hard Drinks, Othes, Seafood, and Starchy Foods are all low contributors to Store Sales.
- In the Data Preprocessing stage, we'll group the Product_Types by performance.
Store_Id¶
# Plotting the distribution of Store_Id wrt our Target Variable
cat_bivariate_barplot('Store_Id')
- Like the Univariate distribution, we see OUT004 as the major contributor to Store Sales
Store_Size¶
# Plotting the distribution of Store_Size wrt our Target Variable
cat_bivariate_barplot('Store_Size')
- Medium-sized stores are the largest contributor to Store Sales, followed by High and ending with Small.
Store_Location_City_Type¶
# Plotting the distribution of City Type wrt our Target Variable
cat_bivariate_barplot('Store_Location_City_Type')
- Tier2 cities are the largest contributor to Store Sales, followed by Tier1 and 3.
Store_Type¶
# Plotting the distribution of Store_Type wrt our Target Variable
cat_bivariate_barplot('Store_Type')
- Supermarket Type2 is the largest contributor to Store Sales, followed by Supermarket Type1, Departmenatal, and Food Mart.
- Also of note here, these values are an exact match of the values given by Store_Id, making these two variables redundant. We'll remove one of them in Data Preprocessing.
Exploring Other Relationships¶
Product_Type wrt Product_Weight¶
def type_vs_boxplot(feature):
plt.figure(figsize=(14, 5))
sns.boxplot(data=data, x='Product_Type', y=feature, hue='Product_Type')
plt.xticks(rotation=90)
plt.title(f'Distribution of {feature} by Product Type')
plt.xlabel('Product Type')
plt.ylabel(feature)
plt.show()
# Plotting the distribution of Product_Weight vs. Product_Type
type_vs_boxplot('Product_Weight')
- Nothing seems to be jumping out at us here. The distribution of weight across the Product Type classes seems to be pretty uniform besides the one outlier in Household, which we were able to isolate earlier.
- Nearly the majority of all products fall within a product weight range of ~11 - 15 units.
Product_Type wrt. Allocated Area¶
#Plotting the distribution of Product_Allocated_Area v. Product_Type
type_vs_boxplot('Product_Allocated_Area')
- There are a great number of outliers within this distribution. Let's try to isolate the items with larger allocated areas.
# Isolating the items with Product_Allocated_Area of > 0.225
area_isolation_df = data[data['Product_Allocated_Area'] > 0.225]
area_isolation_df.shape
- There are 63 products with allocated space of larger than .225. Let's compare those items' statistical summary to the statistical summary without those items.
area_isolation_df.describe()
data[data['Product_Allocated_Area'] < 0.225].describe()
- There doesn't seem to be much difference in the sales performance of the items given much more space than those with less space other than the minimum value of Product MRP and Product Store Sales Total being drastically higher in items with more space.
- Let's see if there is any relationship between Product_MRP and Allocation Area
# Plotting the Product_MSP of High Space Allocation Items
plt.figure(figsize=(14, 5))
sns.scatterplot(data=area_isolation_df, x='Product_Allocated_Area', y='Product_MRP', hue='Product_Allocated_Area')
plt.title('Relationship between Product MRP and Product Allocated Area')
plt.show()
# Plotting the Product_MRP of Low Space Allocation Items
plt.figure(figsize=(14, 5))
sns.scatterplot(data=data[data['Product_Allocated_Area'] < 0.225], x='Product_Allocated_Area', y='Product_MRP', hue='Product_Allocated_Area')
plt.title('Relationship between Product MRP and Product Allocated Area')
plt.show()
- Nothing really jumping out at me here. Let's try looking at the number of transactions of items with high allocation space to determine if these are highly transactional items that sell high volume.
# Creating a dataframe to show the difference between the transaction totals in the different Product_Allocation_Area values
sales_ratio_df = pd.DataFrame()
sales_ratio_df['Transactions'] = data['Product_Store_Sales_Total'] / data['Product_MRP']
sales_ratio_df['Allocation_Area'] = data['Product_Allocated_Area']
sales_ratio_df['Allocation_Area'] = sales_ratio_df['Allocation_Area'].apply(lambda x: 'High' if x > 0.225 else 'Low')
sales_ratio_df.head()
# Plotting the transactions by Allocation Area
plt.figure(figsize=(8, 5))
sns.barplot(x='Allocation_Area', y='Transactions', data=sales_ratio_df)
plt.title('Transactions by Allocation Area')
plt.xlabel('Allocation Area')
plt.ylabel('Transactions')
plt.show()
- This variable is so completely noisy that I'm led to believe there may be something off. If Product_Allocation_Area is a percentage of store space, then all of the sum of all Product_Allocation_Areas by store should equal 1. Let's check.
# Calculating the sum of Product_Allocated_Area by Store_Id
data.groupby('Store_Id')['Product_Allocated_Area'].sum()
- AHA! These values are not correct. They also don't align with the store sizes either. OUT001 is a "high" size store, OUT002 is a "small" size store, OUT003 and OUT004 are both "medium" size stores, but the allocation areas neither sum up to one or reflect the size difference of the stores. To experiment, we may scale these values to see what effect it has on our model score. For now, let's create a category of these scaled values.
# Creating a new variable - "Scaled_Product_Allocation" that divides the current "Product_Allocated_Area" by the sums
data['Scaled_Product_Allocation'] = data['Product_Allocated_Area'] / data.groupby('Store_Id')['Product_Allocated_Area'].transform('sum')
# Checking that the sum of 'Scaled_Product_Allocation' grouped by store = 1
data.groupby('Store_Id')['Scaled_Product_Allocation'].sum()
- Perfect. Now let's plot some of this new category to see if we can derive any insights.
# Plotting the distribution of 'Scaled_Product_Allocation'
histogram_boxplot(data, "Scaled_Product_Allocation")
- This new variable's distribution is obviously similar to the original distribution of Product_Allocated_Area, but doesn't seem as noisy and has a smoother distribution curve that may help our model make better sense of it. We'll experiment with it later.
# Plotting 'Scaled Product Allocation' wrt. Product Type
type_vs_boxplot('Scaled_Product_Allocation')
- Scaling it has produced WAY more outliers, but I'm hoping that since we're going with various decision trees for our models that they will be robust against these outliers.
# Let's check for correlations with our new variable
# Adding my numeric variables to a list
num_vars = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Plotting the heatmap
plt.figure(figsize=(10, 7.5))
sns.heatmap(data[num_vars].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
- Obviously, since we scaled the Allocated Area by store, it produced some correlation to store age, since the store age is specific to each store.
Product_Weight wrt. Product_Sugar_Content¶
# Plotting the distribution Product_Weight wrt. Product_Sugar_Content
plt.figure(figsize=(14, 8))
sns.boxplot(data=data, x='Product_Sugar_Content', y='Product_Weight', hue='Product_Sugar_Content')
plt.xticks(rotation=90)
plt.title('Distribution of Product Weight by Product Sugar Content')
plt.xlabel('Product Sugar Content')
plt.ylabel('Product Weight')
plt.show()
- Nothing remarkable jumping out here as the distribution of product weight is fairly similar among the different sugar contents.
Product_MRP by Product_Weight¶
# Plotting the relationship between Product_MRP and Product_Weight
plt.figure(figsize=(14, 5))
sns.scatterplot(data=data, x='Product_Weight', y='Product_MRP', hue='Product_MRP')
plt.title('Relationship between Product MRP and Product Weight')
plt.xlabel('Product Weight')
plt.ylabel('Product MRP')
plt.show()
- We can see a positive linear relationship here, but it's not surprising since we know that Product_Store_Sales_Total and Product_Weight are also positively related.
Product_Type by Product_Sugar_Content¶
# Plotting the count of Product_Types by each Product_Sugar_Content category
plt.figure(figsize=(14, 6))
sns.heatmap(pd.crosstab(data['Product_Sugar_Content'], data['Product_Type']), annot=True, fmt='d', cmap='Blues')
plt.title('Count of Product Types by Product Sugar Content')
plt.xlabel('Product Type')
plt.ylabel('Product Sugar Content')
plt.show()
- Not surprising that Health and Hygeine, Household, and Others are the only contributors to the 'No Sugar' category.
- I find it surprising that Snack Foods has so many Low Sugar items when they are notorious for high sugar content.
- However, this does explain why Low Sugar was the highest contributor to product sales as well as Fruits and Vegetables and Snack Foods being high contributors when they are all aligned.
- Also of note, it seems quite obvious that "Regular" and "reg" are the same category as they share null values and similar value ratios for each category, but "reg" is much smaller. We'll combine these two later on in Data Preprocessing.
Product_Type by Store_Id¶
# Plotting the count of Product_Types sold in each Store_Id
#Computing the percentages
cross_tab = pd.crosstab(data['Store_Id'], data['Product_Type'], normalize='index') * 100
# Plotting the heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(cross_tab, annot=True, fmt='.1f', cmap='cividis')
plt.title('% of Product Types by Store')
plt.xlabel('Product Type')
plt.ylabel('Store Id')
plt.show()
- Again, Fruits and Vegetables and Snack Foods are the highest two categories for all store types.
- Interesting to note that the best performing store (OUT004) sells the lowest ratio of dairy to other products.
Product_Type wrt. Product_MRP¶
# Plotting the distribution of Product_MRP wrt. Product_Type
type_vs_boxplot('Product_MRP')
- Starchy foods seem to have the highest median Product_MRP
- Meat shows the highest 75% range of Product_MRP
- Household has the highest outlier by Product_MRP that we isolated earlier
Product_MRP by Store_Id¶
# Plotting the Product_MRP by Store_Id
plt.figure(figsize=(14, 8))
sns.boxplot(data=data, x='Store_Id', y='Product_MRP', hue='Store_Id')
plt.xticks(rotation=90)
plt.title('Distribution of Product MRP by Store Id')
plt.xlabel('Store Id')
plt.ylabel('Product MRP')
plt.show()
- This is informative! OUT003 has the highest Product_MRP, followed by OUT001, OUT004, and OUT002.
- While the order of OUT003, OUT001, and OUT002 aligns with their overall contributions to Product_Store_Sales_Total, OUT004 is much lower than I would have expected given that they are the top contributor to Product_Store_Sales_Total.
- This could be a function of their locations or may have other contributing factors. Let's take a look.
Store Deep Dive¶
OUT001 Deep Dive¶
# Getting the statistical summary for OUT001
data.loc[data['Store_Id'] == 'OUT001'].describe(include='all')
- OUT001 carries a total of 1586 products with Snack Foods being the most common item carried and an average Product_MRP of 160.5, with prices ranging from 71 to 226.
- It is their oldest store and largest, built in 1987, and is a high-sized, Supermarket Type1 store located in a Tier 2 city.
- Product revenue ranges from 2300 to 4997 with an average product revenue of ~3923.
# Calculating total sales from OUT001
OUT001_Total_Sales = data.loc[data['Store_Id'] == 'OUT001']['Product_Store_Sales_Total'].sum()
print(OUT001_Total_Sales)
- OUT001 contributed 6.2M in total sales
# Plotting Product_Store_Sales by Product_Type for store OUT001
OUT001_df = (data.loc[data['Store_Id'] == 'OUT001'].groupby(['Product_Type'], as_index=False)['Product_Store_Sales_Total'].sum())
plt.figure(figsize=(14, 5))
sns.barplot(data=OUT001_df, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.title('Product Store Sales by Product Type for OUT001')
plt.xlabel('Product Type')
plt.ylabel('Product Store Sales')
plt.show()
- OUT001 gets the most revenue from Fruits and Vegetables and Snack Foods (in line with the total dataset). Both contribute around 800,000 revenue each.
- OUT001 struggles with Breakfast and Seafood categories, neither topping 100,000 in revenue.
OUT002 Deep Dive¶
# Getting the statistical summary for OUT002
data.loc[data['Store_Id'] == 'OUT002'].describe(include='all')
- OUT002 carries a total of 1152 products with Fruits and Vegetables being the most common item carried and an average Product_MRP of 107, with prices ranging from 31 to 224.
- It is their second oldest and smallest store, built in 1998, and is a small-sized, Food Mart in a Tier 3 city (which explains why it is the poorest performing store).
- Product revenue ranges from 33 to 2299 with an average product revenue of ~1762.
# Calculation OUT002 sales total
OUT002_Total_Sales = data.loc[data['Store_Id'] == 'OUT002']['Product_Store_Sales_Total'].sum()
print(OUT002_Total_Sales)
- OUT002 contributed 2.0M in total sales
While we're at it, let's find out what that minimum priced item is that only contributed 33 to OUT002 Product_Store_Sales_Total
data.loc[data['Product_MRP'] == 31]
- Okay, that's weird. I would have sworn this was the item that only contributed to a Product_Store_Sales_Total of 33. Let's find that particular product.
data.loc[data['Product_Store_Sales_Total'] == 33]
- It's not surprising that the lowest contributing product is in store OUT002. If the Product_MRP is 75.44, but it only contributed to 33 in Product_Store_Sales_Total, they must have had to sell it on clearance or had returns on the product. Hopefully, we're no longer selling it at this location.
- Learning from our "power item" investigation, let's see if there are other items in the dataset in which the Store Sales Total is less than the Product_MSRP.
# Creating a dataframe of products in which Product_Store_Sales_Total is less than Product_MSRP
return_risk_df = data[data['Product_Store_Sales_Total'] < data['Product_MRP']]
return_risk_df
- It seems this is probably either a typo or an extreme outlier. Let's drop it from the dataset in Outlier Detection.
# Plotting Product_Store_Sales by Product_Type for store OUT002
OUT002_df = (data.loc[data['Store_Id'] == 'OUT002'].groupby(['Product_Type'], as_index=False)['Product_Store_Sales_Total'].sum())
plt.figure(figsize=(14, 5))
sns.barplot(data=OUT002_df, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.title('Product Store Sales by Product Type for OUT002')
plt.xlabel('Product Type')
plt.ylabel('Product Store Sales')
plt.show()
- Also in line with the total dataset, OUT002 gets most of its revenue from Fruits and Vegetables and Snack Foods, but at much lower revenues of ~300,000 and ~250,000 respectively.
- OUT002 struggles with Breads, Breakfast, Others, Seafood, and Starchy Foods - none topping 50,000 in revenue.
OUT003 Deep Dive¶
# Getting the statistical summary for OUT003
data.loc[data['Store_Id'] == 'OUT003'].describe(include='all')
- OUT003 carries a total of 1349 products with Snack Foods being the most common and an average Product_MRP of 181 and prices ranging from 85 to 266.
- It is their second newest store, built in 1999, and is a medium-sized, Departmental Store in a Tier 1 city.
- Product revenue ranges from 3069 to 8000 with an average product revenue of 4946. This is a marked difference from their other stores.
- It should be noted that this is their highest priced store.
# Calculating the total sales of OUT003
OUT003_Total_Sales = data.loc[data['Store_Id'] == 'OUT003']['Product_Store_Sales_Total'].sum()
print(OUT003_Total_Sales)
- OUT003 contributed 6.6M to total sales
# Plotting Product_Store_Sales by Product_Type for store OUT003
OUT003_df = (data.loc[data['Store_Id'] == 'OUT003'].groupby(['Product_Type'], as_index=False)['Product_Store_Sales_Total'].sum())
plt.figure(figsize=(14, 5))
sns.barplot(data=OUT003_df, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.title('Product Store Sales by Product Type for OUT003')
plt.xlabel('Product Type')
plt.ylabel('Product Store Sales')
plt.show()
- Fruits and Vegetables and Snack Foods are the highest contributors to OUT003 sales, but in this case, Snack Foods is the highest contributor rather than F&V's. Both contribute well over 800,000 each.
- Dairy performs a little better here than the other stores.
- Like the others, OUT003 struggles withe Breakfast and Seafood as the lowest two contributors.
OUT004 Deep Dive¶
# Getting the statistical summary for OUT004
data.loc[data['Store_Id'] == 'OUT004'].describe(include='all')
- OUT004 carries a total of 4676 products with Fruits and Vegetables being the most common with an average Product_MRP of 142 and prices ranging from 83 to 197.
- It is their newest store, built in 2009, and is a medium-sized, Supermarket Type 2 store in a Tier 2 city.
- Product Revenue ranges from 1561 to 5462 with an average product revenue of ~3300.
- It should be noted that this is their highest volume store.
# Calculating OUT004 total sales
OUT004_Total_Sales = data.loc[data['Store_Id'] == 'OUT004']['Product_Store_Sales_Total'].sum()
print(OUT004_Total_Sales)
- OUT004 contributed 15.4M to total sales
# Plotting Product_Store_Sales by Product_Type for store OUT004
OUT004_df = (data.loc[data['Store_Id'] == 'OUT004'].groupby(['Product_Type'], as_index=False)['Product_Store_Sales_Total'].sum())
plt.figure(figsize=(14, 5))
sns.barplot(data=OUT004_df, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.title('Product Store Sales by Product Type for OUT004')
plt.xlabel('Product Type')
plt.ylabel('Product Store Sales')
plt.show()
- Like all of the others, Fruits and Vegetables and Snack Foods are the largest contributors here at close to or well of 2M each.
- Unlike the others, Frozen Foods is the third largest contributor rather than Dairy.
- OUT004 struggles with Breakfast, Others, and Seafood as the lowest contributors.
# Comparing the total sales by store
store_sales_df = pd.DataFrame([OUT001_Total_Sales, OUT002_Total_Sales, OUT003_Total_Sales, OUT004_Total_Sales], index=['OUT001', 'OUT002', 'OUT003', 'OUT004'], columns=['Product_Store_Sales_Total'])
store_sales_df
City Tier Deep Dive¶
Since deep diving into the Store Id was so informative, let's do the same with City Tier.
Tier 1¶
data.loc[data['Store_Location_City_Type'] == 'Tier 1'].describe(include='all')
- There is a total of 1349 products sold in our Tier 1 city, with Snack Foods being the most common with an average Product_MRP of 181 and prices ranging from ~85 - 266.
- There is only one store type and size in this city - Departmental Store of Medium size.
- Product revenue ranges from ~3069 - 8000 with an average product revenue of 4946.
# Calculating the total sales from Tier 1 city
Tier1_Total_Sales = data.loc[data['Store_Location_City_Type'] == 'Tier 1']['Product_Store_Sales_Total'].sum()
print(Tier1_Total_Sales)
- Tier 1 accounts for 6.6M of total sales
Tier 2¶
data.loc[data['Store_Location_City_Type'] == 'Tier 2'].describe(include='all')
- There is a total 6262 products sold in our Tier 2 cities, with Fruits and Vegetables being the most common and an average Product_MRP of 146 and prices ranging from ~71 - 226. Slightly lower priced than Tier 1.
- There are two stores in this tier.
- Product Revenue ranges from ~3457 - 5462 with an average product revenue of ~3457 - quite different from Tier 1.
# Calculating the total sales of Tier 2
Tier2_Total_Sales = data.loc[data['Store_Location_City_Type'] == 'Tier 2']['Product_Store_Sales_Total'].sum()
print(Tier2_Total_Sales)
- Tier 2 accounted for 21.6M of total sales
Tier 3¶
data.loc[data['Store_Location_City_Type'] == 'Tier 3'].describe(include='all')
- There is a total of 1152 product sold in our Tier 3 city, with Fruits and Vegetables being most common and an average Product_MRP of 107 and prices ranging from ~31 - 224. This is our cheapest price store.
- There is one store in this Tier - a small Food Mart.
- Product Revenue ranges from ~33 - 2299 with an average product revenue of ~1762. This is by far our lowest performing City Tier.
# Calculating Tier 3 total sales
Tier3_Total_Sales = data.loc[data['Store_Location_City_Type'] == 'Tier 3']['Product_Store_Sales_Total'].sum()
print(Tier3_Total_Sales)
- Tier 3 accounted for 2.0M of total sales
# Comparing total sales by city tier
city_tier_sales_df = pd.DataFrame([Tier1_Total_Sales, Tier2_Total_Sales, Tier3_Total_Sales], index=['Tier 1', 'Tier 2', 'Tier 3'], columns=['Product_Store_Sales_Total'])
city_tier_sales_df
Data Preprocessing¶
Ordinal Encoding¶
Store_Location_City_Type¶
Given that we saw a marked difference in the average product performance and pricing in our Store_Location_City_Type variables, I think I'm going to give those variables an ordinal encoding that may help capture their importance a little better. We'll base the values on percentage of total sales.
# Mapping the Store_Location_City_Type to ordinal values
city_type_map = {'Tier 1':1, 'Tier 2': 2, 'Tier 3': 3}
data['Store_Location_City_Type'] = data['Store_Location_City_Type'].map(city_type_map)
data.head()
Product_Sugar_Content Cleanup¶
We noticed earlier that there are two values within Product_Sugar_Content that are quite similar, "Regular" and "reg". We'll combine those now.
# Replacing "reg" in Product_Sugar_Content with "Regular"
data['Product_Sugar_Content'] = data['Product_Sugar_Content'].replace('reg', 'Regular')
# Checking our work by displaying the values counts for each category
data['Product_Sugar_Content'].value_counts()
Product_Type Cleanup¶
As we noted earlier, there are certain categories that are high-performing categories, there are mid-performing categories, and low-performing categories. To reduce the number of Product_Types, we're going to group them into performance categories.
# As a reminder, let's list the unique values to ensure we don't miss any when grouping them
data['Product_Type'].value_counts()
# Creating the performance categories
High_performing = ['Fruits and Vegetables', 'Snack Foods']
Mid_performing = ['Baking Goods', 'Canned', 'Dairy', 'Frozen Foods', 'Health and Hygiene', 'Household', 'Meat', 'Soft Drinks']
Low_performing = ['Breads', 'Breakfast', 'Hard Drinks', 'Others', 'Seafood', 'Starchy Foods']
# Replacing the values in the Product_Type column with their respective new values
data['Product_Type'] = data['Product_Type'].replace(High_performing, 'High Performing')
data['Product_Type'] = data['Product_Type'].replace(Mid_performing, 'Mid Performing')
data['Product_Type'] = data['Product_Type'].replace(Low_performing, 'Low Performing')
data['Product_Type'].value_counts()
Feature Engineering - Power Items¶
Beyond categorizing the Product_Types into performance categories, let's also add a column that will identify our power items.
# Adding a column to our dataframe that identifies the power items from our power_item_df
data['Power_Item'] = data['Product_Id'].isin(power_item_df['Product_Id']).astype(int)
# Checking our work
data['Power_Item'].value_counts()
Outlier Check¶
# Checking for outliers using boxplots
num_columns = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
num_columns.remove('Store_Establishment_Year')
plt.figure(figsize=(15, 12))
for i, column in enumerate(num_columns):
plt.subplot(6, 6, i + 1)
plt.boxplot(data[column], whis=1.5)
plt.tight_layout()
plt.title(column)
plt.show()
- Though we have multiple outliers in all of our numeric variables, this looks pretty realistic for a retail environment where prices, product sizes, and allocation area can vary drastically.
- However, since we identified only a single product in the dataset that had a Product_MRP that was higher than Product_Store_Sales_Total, we're going to remove that single outlier.
# Removing our one outlier from the dataset where Product_MRP > Product_Store_Sales_Total
data = data[data['Product_MRP'] < data['Product_Store_Sales_Total']]
data.shape
Dropping Redundant Variables¶
- We have some redundant variables in our dataset. Specifically, Store_Id, Store_Establishment_Year, and Store_Type are all unique to each store. We're going to drop all of them other than Store_Id.
- We'll also need to drop Product_Id since it is unique to each product.
# Dropping Store_Establishment_Year and Store_Type from the dataset
data.drop(['Store_Establishment_Year', 'Store_Type'], axis=1, inplace=True)
# Dropping Product_Id
data.drop('Product_Id', axis=1, inplace=True)
# Checking our work and taking another look at the head
data.head()
Splitting the Dataset¶
- We're going to do three tests based on the uncertainty in the noise attributed to Product_Allocated_Area:
- We're going to test the model with Product_Allocated_Area as part of the data
- We're going to test the model with Scaled_Product_Allocation as part of the data
- We're going to drop both, Product_Allocated_Area and Scaled_Product_Allocation
# Separating our features and our target variable into X and y respectively
X_experiment1 = data.drop(['Product_Store_Sales_Total', 'Scaled_Product_Allocation'], axis=1)
X_experiment2 = data.drop(['Product_Store_Sales_Total', 'Product_Allocated_Area'], axis=1)
X_experiment3 = data.drop(['Product_Store_Sales_Total', 'Scaled_Product_Allocation', 'Product_Allocated_Area'], axis=1)
y = data['Product_Store_Sales_Total']
# Splitting the dataset into train and test and 80/20 ratio
X_exp1_train, X_exp1_test, y_train, y_test = train_test_split(X_experiment1, y, test_size=0.2, random_state=42, shuffle=True)
X_exp2_train, X_exp2_test, y_train, y_test = train_test_split(X_experiment2, y, test_size=0.2, random_state=42, shuffle=True)
X_exp3_train, X_exp3_test, y_train, y_test = train_test_split(X_experiment3, y, test_size=0.2, random_state=42, shuffle=True)
# Checking the shapes of our training and test sets
X_exp1_train.shape, X_exp1_test.shape, X_exp2_train.shape, X_exp2_test.shape, X_exp3_train.shape, X_exp3_test.shape
X_exp1_train.head()
X_exp2_train.head()
X_exp3_train.head()
Data Pre-Processing and Pipeline¶
- Since our numerical features are going to change with each training and test set, we'll have to define our numeric feature values just before beginning the model pipeline.
Model Building¶
Define functions for Model Evaluation¶
- We're going to build, tune, and test 2 models on the data and observe their performance.
- Since this dataset is full of outliers and noise, we're going with Random Forest for our baseline model and XGBoost as what we hope to be our primary model.
- We will use R-square as our evaluation metric to tell us how much of the sales variation can be explained and as a good overall estimation of model fit.
- We'll hyperparameter tune using GridSearchCV and use r_2 score to optimize the model.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mean_absolute_percentage_error(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
The ML models to be built can be any two out of the following:
- Decision Tree
- Bagging
- Random Forest
- AdaBoost
- Gradient Boosting
- XGBoost
Random Forest¶
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp1_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating an instance of Random Forest model
rf_model = RandomForestRegressor(random_state=42)
# Preprocessing the data for the model
rf_model = make_pipeline(preprocessor, rf_model)
# Fitting the data to the model and training
rf_model.fit(X_exp1_train, y_train)
# Checking the model performance on the training set
rf_model_train_perf = model_performance_regression(rf_model, X_exp1_train, y_train)
rf_model_train_perf
# Checking the model performance on the test set
rf_model_test_perf = model_performance_regression(rf_model, X_exp1_test, y_test)
rf_model_test_perf
- WOW! Very nice starting point!
- Our RF model explains about 95.6% of the variance in the data, and it is only off by ~3% on predictions and an average of ~79 sales units per product with Product_Allocated_Area included in the dataset.
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp2_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating an instance of Random Forest model
rf_model2 = RandomForestRegressor(random_state=42)
# Preprocessing the data for the model
rf_model2 = make_pipeline(preprocessor, rf_model2)
# Fitting the data to the model and training
rf_model2.fit(X_exp2_train, y_train)
# Checking the model performance on the training set
rf_model2_train_perf = model_performance_regression(rf_model2, X_exp2_train, y_train)
rf_model2_train_perf
# Checking the model performance on the test set
rf_model2_test_perf = model_performance_regression(rf_model2, X_exp2_test, y_test)
rf_model2_test_perf
- By removing Product_Allocated_Area and replacing it with the Scaled_Product_Allocation variable, we were able to get just a tiny bit over improvement in the model.
- R-squared improved by ~0.0003%, MAE improved by about an average of 0.05 units per product.
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp3_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating an instance of Random Forest model
rf_model3 = RandomForestRegressor(random_state=42)
# Preprocessing the data for the model
rf_model3 = make_pipeline(preprocessor, rf_model3)
# Fitting the data to the model and training
rf_model3.fit(X_exp3_train, y_train)
# Checking the model performance on the training set
rf_model3_train_perf = model_performance_regression(rf_model3, X_exp3_train, y_train)
rf_model3_train_perf
# Checking the model performance on the test set
rf_model3_test_perf = model_performance_regression(rf_model3, X_exp3_test, y_test)
rf_model3_test_perf
# Creating a dataframe to compare all of the test metrics of all 3 of our RF models
rf_performance_df = pd.concat([rf_model_test_perf, rf_model2_test_perf, rf_model3_test_perf], ignore_index=True)
rf_performance_df.index = ['Product_Allocated_Area', 'Scaled_Product_Allocation', 'No Allocation']
print("Random Forest Test Performance Comparison")
rf_performance_df
- Removing both variables, Product_Allocation_Area and Scaled_Product_Allocation actually hurt our performance with RF.
- But in the name of science, and since we've already done the work, let's continue testing with all three sets.
XGBoost¶
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp1_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating and instance of XGBoost model
xgb_model = XGBRegressor(random_state=42)
# Preprocessing the data for the model
xgb_model = make_pipeline(preprocessor, xgb_model)
# Fitting the data to the model and training
xgb_model.fit(X_exp1_train, y_train)
# Checking the model performance on the training set
xgb_model_train_perf = model_performance_regression(xgb_model, X_exp1_train, y_train)
xgb_model_train_perf
# Checking the model performance on the test set
xgb_model_test_perf = model_performance_regression(xgb_model, X_exp1_test, y_test)
xgb_model_test_perf
- With Product_Allocated_Area included, our XGB model is explaining about 95% of the variance, off by about 3% and an average of ~97 sales units per product.
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp2_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating and instance of XGBoost model
xgb_model2 = XGBRegressor(random_state=42)
# Preprocessing the data for the model
xgb_model2 = make_pipeline(preprocessor, xgb_model2)
# Fitting the data to the model and training
xgb_model2.fit(X_exp2_train, y_train)
# Checking the model performance on the training set
xgb_model2_train_perf = model_performance_regression(xgb_model2, X_exp2_train, y_train)
xgb_model2_train_perf
# Checking the model performance on the test set
xgb_model2_test_perf = model_performance_regression(xgb_model2, X_exp2_test, y_test)
xgb_model2_test_perf
- It seems including Space_Allocation_Area instead of Product_Allocated_Area has hurt our XGB model's performance by the slimmest of margins - exactly opposite of RF.
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp3_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating and instance of XGBoost model
xgb_model3 = XGBRegressor(random_state=42)
# Preprocessing the data for the model
xgb_model3 = make_pipeline(preprocessor, xgb_model3)
# Fitting the data to the model and training
xgb_model3.fit(X_exp3_train, y_train)
# Checking the model performance on the training set
xgb_model3_train_perf = model_performance_regression(xgb_model3, X_exp3_train, y_train)
xgb_model3_train_perf
# Checking the model performance on the test set
xgb_model3_test_perf = model_performance_regression(xgb_model3, X_exp3_test, y_test)
xgb_model3_test_perf
# Creating a dataframe of all of the metrics of all of our XGB models for comparison
xgb_performance_df = pd.concat([xgb_model_test_perf, xgb_model2_test_perf, xgb_model3_test_perf], ignore_index=True)
xgb_performance_df.index = ['Product_Allocated_Area', 'Scaled_Product_Allocation', 'No Allocation']
print("XGBoost Test Performance Comparison")
xgb_performance_df
- How surprising! Training our RF model without either product allocated space variable produced the lowest performance, but produced the highest performance with our XGB model.
Model Performance Improvement - Hyperparameter Tuning¶
- Going forward, we'll use the Scaled_Product_Allocation dataset (X_exp2) for Random Forest tuning, and we'll use the 'No Allocation' dataset (X_exp3) for XG Boost tuning.
Random Forest Tuned¶
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp2_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating another instance of Random Forest to tune
rf_tuned2 = RandomForestRegressor(random_state=42)
# Preprocessing the data for the model
rf_tuned2 = make_pipeline(preprocessor, rf_tuned2)
#Defining my cross-validation strategy
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
# Setting up the parameter grid
parameters = {
'randomforestregressor__n_estimators': [5, 10, 20, 30],
'randomforestregressor__max_features': [2, 3, 5, 7],
'randomforestregressor__max_depth' : [2, 3, 4, 5],
}
# Running the GridSearchCV
grid_obj = GridSearchCV(rf_tuned2, parameters, cv=cv_strategy, scoring='r2', n_jobs=-1)
grid_obj.fit(X_exp2_train, y_train)
# Setting the best parameters on our tuned model
rf_tuned2 = grid_obj.best_estimator_
# Fitting the data to the model
rf_tuned2.fit(X_exp2_train, y_train)
grid_obj.best_params_
# Checking the performance of rf_tuned2 on the training set
rf_tuned2_train_perf = model_performance_regression(rf_tuned2, X_exp2_train, y_train)
rf_tuned2_train_perf
# Checking the peformance of rf_tuned2 on the test set
rf_tuned2_test_perf = model_performance_regression(rf_tuned2, X_exp2_test, y_test)
rf_tuned2_test_perf
- With Scaled_Product_Allocation, overfitting has been nearly eliminated but at the cost of performance as our tuned RF is down ~5% on R-squared score compared to the base model.
XGBoost Tuned¶
# Defining our categorical features to a list
cat_features = data.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_exp3_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating another instance of XGBoost
xgb_tuned = XGBRegressor(random_state=42)
# Preprocessing the data for the model
xgb_tuned = make_pipeline(preprocessor, xgb_tuned)
# Defining our cross-validation strategy
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
# Setting the parameter grid
parameters = {
'xgbregressor__n_estimators': [10, 20, 30, 40, 50],
'xgbregressor__subsample': [0.6, 0.7, 0.8],
'xgbregressor__gamma': [0, 0.3, 0.5, 1, 3],
'xgbregressor__colsample_bytree': [0.6, 0.7, 0.8],
'xgbregressor__max_depth': [2, 3, 4, 5, 6],
'xgbregressor__colsample_bylevel': [0.6, 0.7, 0.8]
}
# Running the GridSearchCV
grid_obj = GridSearchCV(xgb_tuned, parameters, cv=cv_strategy, scoring='r2', n_jobs=-1)
grid_obj.fit(X_exp3_train, y_train)
# Setting the best parameters on our tuned model
xgb_tuned = grid_obj.best_estimator_
# Fitting the data to the model
xgb_tuned.fit(X_exp3_train, y_train)
# Listing the best parameters our xgb_tuned model found
grid_obj.best_params_
# Checking the model's performance on the training set
xgb_tuned_train_perf = model_performance_regression(xgb_tuned, X_exp3_train, y_train)
xgb_tuned_train_perf
# Checking the model's performance on the test set
xgb_tuned_test_perf = model_performance_regression(xgb_tuned, X_exp3_test, y_test)
xgb_tuned_test_perf
- Without either Product_Allocated_Area or Scaled_Product_Allocation, our XGB tuned model explained about 95.4% of the variance, and was off by ~4% on predictions by an average of ~97.6 sales units per product.
# Creating a dataframe to compare all of the test metrics of all 4 our best models so far
model_performance_matrix_df = pd.concat([rf_model2_test_perf, rf_tuned2_test_perf, xgb_model3_test_perf, xgb_tuned_test_perf], ignore_index=True)
model_performance_matrix_df.index = ['RF Scaled', 'RF Tuned Scaled', 'XGB No Allocation', 'XGB Tuned No Allocation']
print("Model Performance Comparison")
model_performance_matrix_df
- Our Random Forest Tuned model with the scaled product allocation performed best. Let's take a look at the feature importances and see where we may be able to further refine it.
# Plotting the feature importances from our Random Forest model trained on Scaled_Product_Allocation
plt.figure(figsize=(10, 6))
# Get feature names and importances
feature_names = rf_tuned2.named_steps['columntransformer'].get_feature_names_out()
importances = rf_tuned2.named_steps['randomforestregressor'].feature_importances_
# Create a pandas Series for easier sorting
feature_importances = pd.Series(importances, index=feature_names)
# Sort the features by importance in descending order
sorted_feature_importances = feature_importances.sort_values(ascending=True)
# Plot the sorted feature importances
plt.barh(sorted_feature_importances.index, sorted_feature_importances.values)
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance Plot')
plt.show()
- It seems the Sugar Content and the Feature Engineering we performed on the product categories may be taking away from the model's performance. Let's remove our feature engineering and the Sugar Content variable and run it back through the model.
# Reloading the data and making the necessary transformations
data = df.copy()
# Creating a new variable - "Scaled_Product_Allocation" that divides the current "Product_Allocated_Area" by the sums
data['Scaled_Product_Allocation'] = data['Product_Allocated_Area'] / data.groupby('Store_Id')['Product_Allocated_Area'].transform('sum')
# Mapping the Store_Location_City_Type to ordinal values
city_type_map = {'Tier 1':1, 'Tier 2': 2, 'Tier 3': 3}
data['Store_Location_City_Type'] = data['Store_Location_City_Type'].map(city_type_map)
# Adding a column to our dataframe that identifies the power items from our power_item_df
data['Power_Item'] = data['Product_Id'].isin(power_item_df['Product_Id']).astype(int)
# Removing our one outlier from the dataset where Product_MRP > Product_Store_Sales_Total
data = data[data['Product_MRP'] < data['Product_Store_Sales_Total']]
# Dropping non-essential variables
data.drop(['Store_Establishment_Year', 'Store_Type', 'Product_Id', 'Product_Allocated_Area', 'Product_Sugar_Content'], axis=1, inplace=True)
# Reviewing our changes
data.head()
# Splitting the new dataset into train and test
X = data.drop('Product_Store_Sales_Total', axis=1)
y = data['Product_Store_Sales_Total']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Defining our categorical features to a list
cat_features = X_train.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initiating a new instance of Random Forest
rf_final = RandomForestRegressor(random_state=42)
# Preprocessing the data for the model
rf_final = make_pipeline(preprocessor, rf_final)
# Fitting rf_final to the dataset
rf_final.fit(X_train, y_train)
# Getting the training performance for rf_final
rf_final_train_perf = model_performance_regression(rf_final, X_train, y_train)
rf_final_train_perf
# Getting the test performance for rf_final
rf_final_test_perf = model_performance_regression(rf_final, X_test, y_test)
rf_final_test_perf
- This is promising. Let's drop the Scaled_Product_Allocation and run it through XG Boost again.
# Dropping 'Scaled_Product_Allocation' X_train and X_test
X_train.drop('Scaled_Product_Allocation', axis=1, inplace=True)
X_test.drop('Scaled_Product_Allocation', axis=1, inplace=True)
# Defining our categorical features to a list
cat_features = X_train.select_dtypes(include=('object', 'category')).columns.tolist()
print(cat_features)
# Defining our numerical features to a list
num_features = X_train.select_dtypes(include=('int64', 'float64')).columns.tolist()
print(num_features)
# Create a preprocessing pipeline for the features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), cat_features), ('passthrough', num_features)
)
# Initializing a new instance of XG Boost with the latest dataset
xgb_final = XGBRegressor(random_state=42)
# Preprocessing the data for the model
xgb_final = make_pipeline(preprocessor, xgb_final)
# Fitting xgb_final to the dataset
xgb_final.fit(X_train, y_train)
# Getting the training performance for xgb_final
xgb_final_train_perf = model_performance_regression(xgb_final, X_train, y_train)
xgb_final_train_perf
# Getting the test peformance for xgb_final
xgb_final_test_perf = model_performance_regression(xgb_final, X_test, y_test)
xgb_final_test_perf
- Not the performance I had hoped for.
- Let's take one last look at our best models and choose the winner.
Model Performance Comparison, Final Model Selection, and Serialization¶
Model Comparison and Final Model Selection¶
# Creating a dataframe to compare all of the test metrics of all 6 of our last models
final_model_performance_matrix_df = pd.concat([rf_tuned2_test_perf, xgb_final_test_perf, xgb_model3_test_perf, xgb_tuned_test_perf, rf_model2_test_perf, rf_final_test_perf], ignore_index=True)
final_model_performance_matrix_df.index = ['RF Tuned Scaled', 'XGB Final w/ Product_Type', 'XGB No Allocation', 'XGB Tuned No Allocation', 'RF Scaled ', 'RF Final w/ Product_Type']
print("Model Performance Comparison")
final_model_performance_matrix_df
- From looking at our Model Performance Comparison, I think we can safely eliminate all but RF Scaled and RF Final, but let's take one last look at how much they are overfit before deciding on a final winner.
# Creating a dataframe of training, test, and difference scores for rf_model2 and rf_final
training_scores = [rf_model2_train_perf['R-squared'][0], rf_final_train_perf['R-squared'][0]]
test_scores = [rf_model2_test_perf['R-squared'][0], rf_final_test_perf['R-squared'][0]]
difference_scores = [rf_model2_test_perf['R-squared'][0] - rf_model2_train_perf['R-squared'][0], rf_final_test_perf['R-squared'][0] - rf_final_train_perf['R-squared'][0]]
final_rf_perf_df = pd.DataFrame({'Training Score': training_scores, 'Test Score': test_scores, 'Difference': difference_scores}, index=['RF Scaled', 'RF Final'])
final_rf_perf_df
- Since RF Final is the least overfit of our best models, we'll move forward with RF Final as the winning model.
# Taking a peek at our Feature Importances of the final model
# Plotting the feature importances from our Random Forest model trained on Scaled_Product_Allocation
plt.figure(figsize=(10, 6))
# Get feature names and importances
feature_names = rf_final.named_steps['columntransformer'].get_feature_names_out()
importances = rf_final.named_steps['randomforestregressor'].feature_importances_
# Create a pandas Series for easier sorting
feature_importances = pd.Series(importances, index=feature_names)
# Sort the features by importance in descending order
sorted_feature_importances = feature_importances.sort_values(ascending=True)
# Plot the sorted feature importances
plt.barh(sorted_feature_importances.index, sorted_feature_importances.values)
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance Plot')
plt.show()
- In all honesty, it may not have been worth the extra effort, but in the retail world, every penny counts!
Model Serialization¶
# Creating the backend file directory
os.makedirs('backend_files', exist_ok=True)
# Defining the file path to save the serialized model along with the preprocessing steps
saved_model_path = 'backend_files/sales_prediction_model_v1_0.joblib'
# Saving the best model pipeline using joblib
joblib.dump(rf_final, saved_model_path)
print(f"Model saved successfully at {saved_model_path}")
# Loading the saved model pipeline from the file
saved_model = joblib.load(saved_model_path)
print("Model loaded successfully")
saved_model
Making Predictions¶
We'll now make some predictions on the test set with our deserialized model.
- First we've got to reinstitute our dataset again
# Reloading the data and making the necessary transformations
data = df.copy()
# Creating a new variable - "Scaled_Product_Allocation" that divides the current "Product_Allocated_Area" by the sums
data['Scaled_Product_Allocation'] = data['Product_Allocated_Area'] / data.groupby('Store_Id')['Product_Allocated_Area'].transform('sum')
# Mapping the Store_Location_City_Type to ordinal values
city_type_map = {'Tier 1':1, 'Tier 2': 2, 'Tier 3': 3}
data['Store_Location_City_Type'] = data['Store_Location_City_Type'].map(city_type_map)
# Adding a column to our dataframe that identifies the power items from our power_item_df
data['Power_Item'] = data['Product_Id'].isin(power_item_df['Product_Id']).astype(int)
# Removing our one outlier from the dataset where Product_MRP > Product_Store_Sales_Total
data = data[data['Product_MRP'] < data['Product_Store_Sales_Total']]
# Dropping non-essential variables
data.drop(['Store_Establishment_Year', 'Store_Type', 'Product_Id', 'Product_Allocated_Area', 'Product_Sugar_Content'], axis=1, inplace=True)
# Splitting the new dataset into train and test
X = data.drop('Product_Store_Sales_Total', axis=1)
y = data['Product_Store_Sales_Total']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Predicting with the deserialized model on our test set
saved_model.predict(X_test)
Fantastic! Our model can be used to make predictions on new data now without the need to be retrained.
Deployment - Backend¶
Flask Web Framework¶
%%writefile backend_files/app.py
# Importing the necesssary libraries
import numpy as np
import pandas as pd
import joblib
from flask import Flask, request, jsonify
# Initializing the Flask app
superkart_api = Flask("SuperKart Sales Predictor")
# Loading the trained model
model = joblib.load('sales_prediction_model_v1_0.joblib')
# Defining a route for the home page
@superkart_api.get('/')
def home():
return "Welcome to SuperKart Sales Predictor"
# Defining and endpoint to predict on a single product
@superkart_api.post('/v1/predict')
def predict_sales():
# Get JSON data from the request
data = request.get_json()
# Extract relevant product attributes from the input data.
sample = {
'Product_Weight': data['Product_Weight'],
'Product_Type': data['Product_Type'],
'Product_MRP': data['Product_MRP'],
'Store_Id': data['Store_Id'],
'Store_Size': data['Store_Size'],
'Store_Location_City_Type': data['Store_Location_City_Type'],
'Scaled_Product_Allocation': data['Scaled_Product_Allocation'],
'Power_Item': data['Power_Item'],
}
# Convert the extracted data to a DataFrame
input_data = pd.DataFrame([sample])
# Make a sales prediction using the trained model
prediction = model.predict(input_data).tolist()[0]
# Return the prediction as a JSON response
return jsonify({'Sales': prediction})
# Running the Flask app in debug mode
if __name__ == '__main__':
superkart_api.run(debug=True)
Dependencies File¶
%%writefile backend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
seaborn==0.13.2
joblib==1.4.2
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.32.3
uvicorn[standard]
streamlit==1.43.2
Dockerfile¶
%%writefile backend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim
# Set the working directory inside the container
WORKDIR /app
# Copy all files from the current directory to the container's working directory
COPY . .
# Install dependencies from the requirements file without using cache to reduce image size
RUN pip install --no-cache-dir --upgrade -r requirements.txt
# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:superkart_api"]
Setting up a Hugging Face Docker Space for the Backend¶
import os
from google.colab import userdata
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
print("HF_TOKEN is in os.environ:", 'HF_TOKEN' in os.environ)
# Importing the login and create_repo function from the huggingface_hub library
from huggingface_hub import login, create_repo
# Logging in to Hugging Face with your API token
login(token=os.environ['HF_TOKEN'])
# Trying to create the repo for a Hugging Face Space
try:
create_repo('SuperKart-Sales-Predictor-Backend', repo_type='space', space_sdk='docker', private=False) # Setting the repo as a docker space and making it public
print('Repo created successfully')
except Exception as e:
# Handling potential errors during repo creation
if 'RepositoryAlreadyExistsError' in str(e):
print("Repo already exists")
else:
print(f'Error creating repo: {e}')
Uploading Files to Hugging Face Space (Docker Space)¶
# Defining the repo_id variable
repo_id = 'SpaceMonkey25/SuperKart-Sales-Predictor-Backend'
# Initialize the API
api = HfApi()
# Upload Streamlit app files stored in the deployment files
api.upload_folder(folder_path='backend_files', repo_id=repo_id, repo_type='space')
print('Files successfully uploaded to Hugging Face')
The URL for this Docker Space is: https://huggingface.co/spaces/SpaceMonkey25/SuperKart-Sales-Predictor-Backend
Deployment - Frontend¶
Points to note before executing the below cells¶
- Create a Streamlit space on Hugging Face by following the instructions provided on the content page titled
Creating Spaces and Adding Secrets in Hugging Facefrom Week 1
Streamlit for Interactive UI¶
# Create a folder for storing the files needed for frontend UI deployment
os.makedirs("frontend_files", exist_ok=True)
%%writefile frontend_files/app.py
import streamlit as st
import requests
# Setting the title of the app
st.title("SuperKart Sales Prediction App")
# Adding a small description
st.write("This tool predicts sales based on product attributes. Please enter them below.")
# Input fields for product and store data
Product_Weight = st.number_input("Product Weight", min_value=0.0, value=12.66)
Product_Type = st.selectbox("Product Type", ['Fruits and Vegetables', 'Snack Foods', 'Baking Goods', 'Canned', 'Dairy', 'Frozen Foods', 'Health and Hygiene', 'Household', 'Meat', 'Soft Drinks', 'Breads', 'Breakfast', 'Hard Drinks', 'Others', 'Seafood', 'Starchy Foods'])
Product_MRP = st.number_input("Product MRP", min_value=0.0, value=147.03)
Store_Id = st.selectbox("Store Id", ['OUT001', 'OUT002', 'OUT003', 'OUT004'])
Store_Size = st.selectbox("Store Size", ['Small', 'Medium', 'High'])
Store_Location_City_Type = st.selectbox("Store Location City Type", ['Tier 1', 'Tier 2', 'Tier 3'])
Product_Allocated_Area = st.number_input("Product Allocated Area", min_value=0.0, value=0.07)
Product_Id = st.text_input("Product Id")
# Adding necessary conversion variables
Allocation_Conversion = {'OUT001': 109.066, 'OUT002': 78.045, 'OUT003': 92.591, 'OUT004': 323.073}
Power_Item = ['NC6961', 'FD576', 'FD320', 'FD432', 'FD179', 'FD193', 'DR7730',
'FD5491', 'FD875', 'NC475', 'FD5447', 'NC548', 'NC278', 'NC208',
'NC580', 'FD340', 'FD414', 'NC304', 'NC413', 'NC114', 'FD4812',
'FD70', 'NC469', 'DR444', 'FD2017', 'FD1824', 'FD4966', 'FD172',
'FD6473', 'FD4825', 'DR977', 'NC527', 'FD2295', 'FD4450', 'FD21',
'FD560', 'NC59', 'FD182', 'FD581', 'FD188', 'NC4618', 'FD565',
'NC551', 'FD2817', 'FD105', 'FD422', 'FD445', 'NC446', 'FD189',
'DR7', 'FD135', 'NC9', 'FD185', 'FD6267', 'FD76', 'FD3633',
'DR4259', 'FD4857', 'FD7168', 'FD5764', 'FD87', 'FD498', 'NC418',
'FD1470', 'FD51', 'DR332', 'FD22', 'DR506', 'NC249', 'NC3594',
'FD2233', 'FD350', 'FD113', 'FD43', 'DR312', 'NC761', 'FD122',
'FD8676', 'FD585', 'NC6327', 'FD578', 'FD237', 'NC238', 'FD274',
'FD253', 'FD8729', 'FD134', 'FD44', 'FD473', 'FD492', 'FD24',
'FD1012', 'FD117', 'FD472', 'NC63', 'FD479', 'FD41', 'NC380',
'DR6796', 'FD561', 'FD507', 'FD4', 'FD233', 'FD5072', 'FD8550',
'FD154', 'FD589', 'NC365', 'FD146', 'NC553', 'NC960', 'NC1564',
'FD802', 'NC101', 'FD7421', 'NC7132', 'DR58', 'FD7638', 'FD2194',
'NC64', 'FD84', 'FD81', 'FD452', 'DR23', 'FD1103', 'FD545',
'NC493', 'NC220', 'FD5985', 'FD541', 'DR325', 'FD337', 'NC570',
'FD27', 'FD180', 'FD62', 'FD42', 'FD225', 'FD554', 'FD8662',
'FD509', 'NC417', 'DR6167', 'FD271', 'NC2358', 'FD166', 'FD178',
'FD309', 'FD6358', 'DR49', 'NC272', 'FD3905', 'NC361', 'DR8580',
'FD2336', 'FD162', 'NC7325', 'FD79', 'FD3771', 'FD2958', 'FD787',
'NC36', 'FD531', 'FD525', 'FD335', 'FD521', 'FD344', 'FD7676',
'NC892', 'FD1714', 'FD533', 'FD152', 'FD2314', 'DR4889', 'DR5085',
'FD366', 'FD6202', 'FD5215', 'FD6353', 'FD25', 'FD8499', 'DR496',
'FD300', 'FD373', 'FD73', 'FD75', 'FD169', 'NC216', 'FD3936',
'FD5303', 'DR591', 'FD8576', 'FD245', 'FD364', 'FD224', 'FD308',
'NC53', 'FD4011', 'FD5564', 'FD270', 'FD3595', 'FD590', 'FD406',
'FD2575', 'FD227', 'FD425', 'FD4262', 'FD286', 'FD4217', 'FD111',
'NC568', 'FD8655', 'FD3310', 'FD264', 'FD461', 'NC231', 'FD415',
'FD6', 'FD4413', 'NC574', 'FD333', 'FD4706', 'NC437', 'FD69',
'NC466', 'FD317', 'FD14', 'FD8199', 'FD7195', 'FD3474', 'FD303',
'FD262', 'FD110', 'FD597', 'FD311', 'FD1636', 'FD104', 'NC356',
'FD3041', 'FD5038', 'FD8404', 'FD4830', 'FD184', 'FD2818',
'FD8713', 'FD126', 'FD5274', 'NC559', 'FD517', 'NC273', 'FD301',
'FD56', 'FD899', 'FD571', 'FD500', 'FD558', 'DR421', 'FD130',
'FD6775', 'FD2409', 'FD136', 'FD206', 'FD239', 'FD725', 'FD487',
'NC386', 'FD202', 'FD2362', 'NC358', 'FD174', 'FD546', 'FD181',
'DR217', 'NC583', 'FD4498', 'FD236', 'FD132', 'FD5823', 'FD412',
'FD4741', 'FD323', 'FD141', 'FD97', 'DR3464', 'FD183', 'FD1519',
'FD149', 'FD107', 'FD966', 'FD1348', 'FD2301', 'FD5', 'FD293',
'FD8203', 'FD394', 'FD3953', 'FD460', 'FD5502', 'FD6506', 'FD427',
'NC3879', 'FD5614', 'FD398', 'FD7468', 'FD481', 'FD277', 'FD143',
'DR528', 'FD7383', 'FD191', 'FD68', 'NC342', 'NC434', 'NC214',
'FD8102', 'FD505', 'FD313', 'FD18', 'FD102', 'NC584']
product_data = {
'Product_Weight': Product_Weight,
'Product_Type': Product_Type,
'Product_MRP': Product_MRP,
'Store_Id': Store_Id,
'Store_Size': Store_Size,
'Store_Location_City_Type': (
1 if Store_Location_City_Type == 'Tier 1'
else 2 if Store_Location_City_Type == 'Tier 2'
else 3
),
'Scaled_Product_Allocation': Product_Allocated_Area / Allocation_Conversion[Store_Id],
'Power_Item': 1 if Product_Id in Power_Item else 0
}
if st.button("Predict", type='primary'):
response = requests.post("https://SpaceMonkey25-SuperKart-Sales-Predictor-Backend.hf.space/v1/predict", json=product_data)
if response.status_code == 200:
result = response.json()
predicted_sales = result["Sales"]
st.write(f"Predicted Product Store Sales Total: ₹{predicted_sales:.2f}")
else:
st.error("Error in API request")
Dependencies File¶
%%writefile frontend_files/requirements.txt
requests==2.32.3
streamlit==1.45.0
DockerFile¶
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim
# Set the working directory inside the container to /app
WORKDIR /app
# Copy all files from the current directory on the host to the container's /app directory
COPY . .
# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt
# Define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]
# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Uploading Files to Hugging Face Space (Streamlit Space)¶
repo_id = 'SpaceMonkey25/SuperKart-Sales-Predictor-Frontend'
# Initialize the API
api = HfApi()
# Upload Streamlit app files stored in the deployment files
api.upload_folder(folder_path='frontend_files', repo_id=repo_id, repo_type='space')
The URL for the frontend UI space is: https://huggingface.co/spaces/SpaceMonkey25/SuperKart-Sales-Predictor-Frontend
Actionable Insights and Business Recommendations¶
Price and Location Drive Sales: Price and store location remain the strongest predictors of future sales.
Product Weight Matters: Product weight significantly impacts sales, likely due to its link with price.
Focus on Power Items: Identify and prioritize “Power Items” — top-performing products that drive category revenue.
Fix Allocation Data: Current product allocation space data appears outdated and inconsistent. Clean-up is essential for reliable insights.
Optimize Store Space: With accurate Power Item and allocation data, optimize space to boost sales efficiency and revenue.
Leverage Tier 1 Pricing: Tier 1 locations outperform others. Review price elasticity to capture untapped margin opportunities.
Adjust for Store Performance: OUT004 consistently overperforms (plan +4%), while OUT002 underperforms (plan –4%) for better forecasting and resource allocation.